Pular para o conteúdo
B
Bradata
AILLMarchitecture

LLM in ERP without inflating the OpenAI bill: 7 architecture patterns (RAG, caching, on-prem)

A deep technical guide to integrating LLMs into enterprise systems with predictable cost. RAG patterns, prompt caching, model routing, self-hosting, function calling — with real cost benchmarks.

By Bradata··9 min read·Ler em português →

The typical story: LLM in ERP, the bill explodes in 90 days

A company hires a software house, says "we want ChatGPT in our ERP". The vendor ships a beautiful POC. Everything works. Approved. After 90 days the production flow is running. Month 3 OpenAI bill: $14,000. The CTO asks "is this normal?". The vendor says "it's because it scaled". The company starts looking for alternatives.

This is the pattern. Integrating LLM in ERP is easy. Integrating LLM in ERP with predictable, sustainable cost is serious architecture.

This post is the dense blueprint of the 7 architecture patterns that separate "amateur-grade production" LLM integration from "sustainable production", with real cost benchmarks, model decisions and implementation examples. Read also LLM integrado em sistemas empresariais: como fazemos na Bradata (Portuguese).

The basic math: why cost scales badly by default

LLM cost = number of tokens × price per token. Each request "naively implemented" consumes:

  • System prompt (instructions) — 800 to 3,000 tokens
  • RAG context — 1,500 to 8,000 tokens
  • User question — 50 to 300 tokens
  • Response — 200 to 1,500 tokens

Typical total: 3,000 to 12,000 tokens per interaction. For Claude Sonnet or GPT-4o ($3/M input, $15/M output as of May 2026):

  • Average request = $0.015 to $0.06
  • 10,000 requests/day = $4,500 to $19,000/month

For a company with 200 active users making 5 requests/day = 30,000/day = $13,500 to $57,000/month. And small usage growth turns into bills that are impossible to absorb.

The 7 patterns below reduce cost between 60% and 95% with minimal or no quality loss.

Pattern 1 — RAG with smart indexing (don't dump the entire PDF into the prompt)

Common mistake: user asks "what's the bid response deadline for tender 122/2026?". System reads the entire PDF (200 pages, 80k tokens) and dumps it into the prompt. Right answer, absurd cost.

The right pattern: RAG (Retrieval-Augmented Generation) implemented well:

  1. Smart chunking: split the document into semantically coherent pieces (not blind 500-token chunks). Good practice: chunk by document section, with configurable overlap (200 tokens) and metadata (page, section, type).
  2. Embeddings: generate vectors of each chunk with a cheap embedding model (text-embedding-3-small from OpenAI, $0.00002 per 1k tokens, or voyage-3-lite).
  3. Vector storage: PostgreSQL with pgvector (open source, performant up to 10M chunks), Pinecone, Qdrant or Weaviate.
  4. Retrieval: for each query, retrieve top-K most-similar chunks (K = 4–8 typically).
  5. Re-ranking (optional, for sensitive cases): apply a re-rank model (Cohere Rerank, voyage-3-rerank) over the initial K to pick the best 3–4.

Result: prompt goes from 80k tokens (whole PDF) to 4k tokens (relevant chunks). Cost drops 20x.

Query → Embedding (40 tokens) → Vector search (sub-second) → Top-K chunks (3k tokens) → LLM (4k tokens) → Answer

This is the central pattern of our VisionApp RAG engine and the Copilot in the Hub.

Pattern 2 — Model routing (right model for the right task)

Most applications use a single model for everything. Costly mistake. Different models have drastically different costs:

ModelInput ($/M tok)Output ($/M tok)Best for...
Claude Haiku 4.5$0.80$4Classification, short summarization, factual Q&A
GPT-4o-mini$0.15$0.60Fast tasks, classification, structured JSON
Llama 3.1 70B (Groq)$0.59$0.79Summarization, factual answers, code completion
Claude Sonnet 4.6$3$15Complex reasoning, long-form writing, analysis
GPT-4o$2.50$10Multimodal, reasoning, structured JSON
Claude Opus 4.7$15$75Cases where you need the best (strategy, legal analysis)

Architectural pattern: explicit routing per task.

Incoming query
    ↓
Classifier (Haiku, $0.001)
    ↓
- Simple factual question → Haiku ($0.01)
- Summarization → Llama 3.1 70B ($0.02)
- Complex analysis → Sonnet ($0.15)
- Critical legal decision → Opus ($0.50)

In production, 70–85% of requests fall into "simple" + "summarization" and only 3–8% need the top model. Average cost drops 60–80% vs "everything in Opus/GPT-4o".

Pattern 3 — Prompt caching (up to 90% discount on repeated content)

Claude and GPT-4o now support prompt caching — you mark portions of the prompt that repeat (system instructions, fixed customer context, product documentation) and the provider caches internally. Cache reads cost 10% of normal cost.

Typical application: your ERP system prompt has 4,000 tokens (behavior instructions, examples, business rules). That prompt goes in every interaction. Without caching, it costs 4,000 tokens × thousands of calls.

With caching:

  • First request: 4,000 tokens at normal rate
  • Subsequent: 4,000 tokens × 10% ($0.0004 vs $0.004 per request)
  • Cache lives 5 minutes in Claude, 5–60 minutes in OpenAI

For a production system with 10k requests/day, this single optimization saves $300–$1,600/month.

Pattern 4 — Batching for non-interactive tasks

Many ERP LLM tasks are not interactive — they're background jobs: classify 500 new tenders, extract attributes from 2,000 new catalog products, generate summaries of 1,000 emails.

For these, use Batch API (Anthropic and OpenAI offer it):

  • Submit a batch of up to 50,000 requests in one file
  • Provider processes within 24 hours (typically 2–6h)
  • Cost is 50% of normal price

For a batch run of 10,000 docs:

  • Interactive: $48
  • Batch: $24

50% off is just the headline. Combined with Model Routing + Caching, batch processing can be 75–88% cheaper than interactive.

Pattern 5 — Self-hosting for high-frequency queries

For internal use, high frequency and when privacy matters (sensitive data, regulated), self-hosting open-source models becomes competitive.

In 2026, reliable open models for production:

ModelSizeMin GPUPerformance
Llama 3.1 8B (Q4)4.6GBA10 / RTX 4090Good for simple tasks
Llama 3.1 70B (Q4)40GB2x A100 80GBClose to GPT-4o-mini
Qwen 2.5 72B41GB2x A100 80GBExcellent for Chinese, good overall
Llama 3.1 405B230GB8x H100Comparable to Claude Sonnet
DeepSeek V3 671B380GB8x H100Comparable to Claude Sonnet, even cheaper

Cost to serve Llama 3.1 70B in production:

  • AWS p4d.24xlarge (8x A100): $32.77/hour = ~$25,000/month running 24/7
  • vs OpenAI: equivalent usage exceeds $38,000/month at the same volume

Break-even: when usage > 30M tokens/day, self-hosting pays off. Below that, managed API is better (cost + ops).

Practical implementation: Ollama for development, vLLM for production (high throughput, automatic batching), TGI (Text Generation Inference) from Hugging Face as an alternative.

For Bradata clients in regulated segments (health, government, financial), we offer pluggable models: the client chooses API or self-hosting based on compliance and budget.

Pattern 6 — Structured function calling instead of text prompts

LLMs know how to use tools/functions structurally. Instead of asking for a response in text and parsing it later (costly, error-prone), you define functions and the LLM calls exactly what it needs.

Common mistake:

Prompt: "Respond in JSON with {name, taxId, amount}. Do not include extra text. Question: who is the client of tender X?"

Right pattern:

tools = [{
  "name": "lookup_client",
  "parameters": {
    "type": "object",
    "properties": {
      "tax_id": {"type": "string"},
      "include_history": {"type": "boolean"}
    }
  }
}]

response = llm.create(messages=[...], tools=tools)
# Response comes back as a structured tool_call, schema-guaranteed

Function calling has schema guarantee (Anthropic and OpenAI validate). Fewer tokens (no format instructions), zero parse errors, cheaper.

Pattern 7 — Cost observability and gatekeeping

Without observability, cost runs away without warning. Minimum stack:

Mandatory metrics

  • Requests per day, per user, per task type
  • Tokens consumed (input + output) per model
  • Cost in USD per day
  • Latency p50, p95, p99
  • Error rate (timeouts, rate limits, empty model returns)

Tools

  • Langfuse (open source, self-hosted) — complete LLM observability, traces, evals
  • Helicone (SaaS) — proxy + observability
  • PostHog (already in many companies) — LLM events plus general analytics

Gatekeeping

  • Per-user rate limit (10 calls/minute, 200/day)
  • Per-tenant budget cap (monthly)
  • Cost alert (notify if daily threshold exceeded)
  • Kill switch if cost spikes (LLM goes temporarily offline pending human review)

Without this, a bug or abusive use can generate $10k in fees overnight.

Comparative scenario: each pattern stacked

Mid-size ERP with 200 users, 5 requests/day average = 30k requests/day. Cost evolves with each pattern applied:

ScenarioMonthly cost (USD)Performance
Naïve (GPT-4o for everything, no RAG, no cache)$43,000OK, but slow
+ RAG$17,000 (-60%)Same
+ Model routing$7,200 (-58%)Same or better
+ Prompt caching$4,200 (-42%)Same
+ Function calling$3,500 (-17%)Better (fewer errors)
+ Batching of background jobs$2,500 (-29%)Same (batch latency ok)
+ Self-hosting Llama 3.1 70B for high-frequency queries$1,800 (-28%)Same
TOTAL with all patterns$1,800/month (-96%)Same or better

$43k down to $1.8k. 24× cheaper with disciplined architecture.

The trap: beware of over-engineering

Before implementing all 7 patterns at once, calibrate by volume:

Daily volumeMinimum patterns
< 100 requestsUse API only. Single model OK. Observability.
100–1,000+ RAG + Caching. Model routing optional.
1,000–10,000All except self-hosting (cost already fits the API)
10,000–50,000All, with aggressive batching
> 50,000All + self-hosting for high-frequency queries

A company starting with 200 requests/day implementing "Pattern 5 — self-hosting with 8x A100" from the start? Makes no sense. Wait for volume to justify it.

Compliance and privacy — where the pattern changes

For clients in healthcare (sensitive data), financial (BCB), government (secrecy), and legal (professional confidentiality), the equation shifts:

  • Data cannot leave for OpenAI/Anthropic (without a processing agreement)
  • Models with no-training clauses (Anthropic Enterprise, OpenAI Enterprise via Azure) may serve
  • Self-hosting becomes preferred even when not the cheapest

For Bradata clients in these segments, we offer:

  • Azure OpenAI (OpenAI model running on Azure, data does not leave tenant)
  • Anthropic via Bedrock (data stays in the client's AWS account)
  • Self-hosting Llama/Qwen/DeepSeek on dedicated GPUs
  • Pseudonymization before any external call when applicable

Conclusion

Integrating LLM in ERP without architecture is easy. Integrating sustainably requires technical discipline. The 7 patterns above reduce cost between 60% and 96% with the same or better quality.

If you're planning to integrate LLM into your enterprise system, talk to us. Bradata has been building enterprise systems with integrated AI since 2023.

Bradata is a Brazilian software house with deep expertise in enterprise software with embedded AI. See our solutions and cases.


Sources: Anthropic API Pricing (May 2026), OpenAI API Pricing (May 2026), Groq Cloud Pricing, AWS GPU Instance Pricing (May 2026), Hugging Face Open LLM Leaderboard, internal Bradata projects, "Building LLM Applications for Production" — Chip Huyen 2024, MLPerf Inference Benchmark 2025.

Need a tech team now?

Talk to Bradata and get a proposal within 24 business hours.