LLM in ERP without inflating the OpenAI bill: 7 architecture patterns (RAG, caching, on-prem)
A deep technical guide to integrating LLMs into enterprise systems with predictable cost. RAG patterns, prompt caching, model routing, self-hosting, function calling — with real cost benchmarks.
The typical story: LLM in ERP, the bill explodes in 90 days
A company hires a software house, says "we want ChatGPT in our ERP". The vendor ships a beautiful POC. Everything works. Approved. After 90 days the production flow is running. Month 3 OpenAI bill: $14,000. The CTO asks "is this normal?". The vendor says "it's because it scaled". The company starts looking for alternatives.
This is the pattern. Integrating LLM in ERP is easy. Integrating LLM in ERP with predictable, sustainable cost is serious architecture.
This post is the dense blueprint of the 7 architecture patterns that separate "amateur-grade production" LLM integration from "sustainable production", with real cost benchmarks, model decisions and implementation examples. Read also LLM integrado em sistemas empresariais: como fazemos na Bradata (Portuguese).
The basic math: why cost scales badly by default
LLM cost = number of tokens × price per token. Each request "naively implemented" consumes:
- System prompt (instructions) — 800 to 3,000 tokens
- RAG context — 1,500 to 8,000 tokens
- User question — 50 to 300 tokens
- Response — 200 to 1,500 tokens
Typical total: 3,000 to 12,000 tokens per interaction. For Claude Sonnet or GPT-4o ($3/M input, $15/M output as of May 2026):
- Average request = $0.015 to $0.06
- 10,000 requests/day = $4,500 to $19,000/month
For a company with 200 active users making 5 requests/day = 30,000/day = $13,500 to $57,000/month. And small usage growth turns into bills that are impossible to absorb.
The 7 patterns below reduce cost between 60% and 95% with minimal or no quality loss.
Pattern 1 — RAG with smart indexing (don't dump the entire PDF into the prompt)
Common mistake: user asks "what's the bid response deadline for tender 122/2026?". System reads the entire PDF (200 pages, 80k tokens) and dumps it into the prompt. Right answer, absurd cost.
The right pattern: RAG (Retrieval-Augmented Generation) implemented well:
- Smart chunking: split the document into semantically coherent pieces (not blind 500-token chunks). Good practice: chunk by document section, with configurable overlap (200 tokens) and metadata (page, section, type).
- Embeddings: generate vectors of each chunk with a cheap embedding model (text-embedding-3-small from OpenAI, $0.00002 per 1k tokens, or voyage-3-lite).
- Vector storage: PostgreSQL with pgvector (open source, performant up to 10M chunks), Pinecone, Qdrant or Weaviate.
- Retrieval: for each query, retrieve top-K most-similar chunks (K = 4–8 typically).
- Re-ranking (optional, for sensitive cases): apply a re-rank model (Cohere Rerank, voyage-3-rerank) over the initial K to pick the best 3–4.
Result: prompt goes from 80k tokens (whole PDF) to 4k tokens (relevant chunks). Cost drops 20x.
Query → Embedding (40 tokens) → Vector search (sub-second) → Top-K chunks (3k tokens) → LLM (4k tokens) → Answer
This is the central pattern of our VisionApp RAG engine and the Copilot in the Hub.
Pattern 2 — Model routing (right model for the right task)
Most applications use a single model for everything. Costly mistake. Different models have drastically different costs:
| Model | Input ($/M tok) | Output ($/M tok) | Best for... |
|---|---|---|---|
| Claude Haiku 4.5 | $0.80 | $4 | Classification, short summarization, factual Q&A |
| GPT-4o-mini | $0.15 | $0.60 | Fast tasks, classification, structured JSON |
| Llama 3.1 70B (Groq) | $0.59 | $0.79 | Summarization, factual answers, code completion |
| Claude Sonnet 4.6 | $3 | $15 | Complex reasoning, long-form writing, analysis |
| GPT-4o | $2.50 | $10 | Multimodal, reasoning, structured JSON |
| Claude Opus 4.7 | $15 | $75 | Cases where you need the best (strategy, legal analysis) |
Architectural pattern: explicit routing per task.
Incoming query
↓
Classifier (Haiku, $0.001)
↓
- Simple factual question → Haiku ($0.01)
- Summarization → Llama 3.1 70B ($0.02)
- Complex analysis → Sonnet ($0.15)
- Critical legal decision → Opus ($0.50)
In production, 70–85% of requests fall into "simple" + "summarization" and only 3–8% need the top model. Average cost drops 60–80% vs "everything in Opus/GPT-4o".
Pattern 3 — Prompt caching (up to 90% discount on repeated content)
Claude and GPT-4o now support prompt caching — you mark portions of the prompt that repeat (system instructions, fixed customer context, product documentation) and the provider caches internally. Cache reads cost 10% of normal cost.
Typical application: your ERP system prompt has 4,000 tokens (behavior instructions, examples, business rules). That prompt goes in every interaction. Without caching, it costs 4,000 tokens × thousands of calls.
With caching:
- First request: 4,000 tokens at normal rate
- Subsequent: 4,000 tokens × 10% ($0.0004 vs $0.004 per request)
- Cache lives 5 minutes in Claude, 5–60 minutes in OpenAI
For a production system with 10k requests/day, this single optimization saves $300–$1,600/month.
Pattern 4 — Batching for non-interactive tasks
Many ERP LLM tasks are not interactive — they're background jobs: classify 500 new tenders, extract attributes from 2,000 new catalog products, generate summaries of 1,000 emails.
For these, use Batch API (Anthropic and OpenAI offer it):
- Submit a batch of up to 50,000 requests in one file
- Provider processes within 24 hours (typically 2–6h)
- Cost is 50% of normal price
For a batch run of 10,000 docs:
- Interactive: $48
- Batch: $24
50% off is just the headline. Combined with Model Routing + Caching, batch processing can be 75–88% cheaper than interactive.
Pattern 5 — Self-hosting for high-frequency queries
For internal use, high frequency and when privacy matters (sensitive data, regulated), self-hosting open-source models becomes competitive.
In 2026, reliable open models for production:
| Model | Size | Min GPU | Performance |
|---|---|---|---|
| Llama 3.1 8B (Q4) | 4.6GB | A10 / RTX 4090 | Good for simple tasks |
| Llama 3.1 70B (Q4) | 40GB | 2x A100 80GB | Close to GPT-4o-mini |
| Qwen 2.5 72B | 41GB | 2x A100 80GB | Excellent for Chinese, good overall |
| Llama 3.1 405B | 230GB | 8x H100 | Comparable to Claude Sonnet |
| DeepSeek V3 671B | 380GB | 8x H100 | Comparable to Claude Sonnet, even cheaper |
Cost to serve Llama 3.1 70B in production:
- AWS p4d.24xlarge (8x A100): $32.77/hour = ~$25,000/month running 24/7
- vs OpenAI: equivalent usage exceeds $38,000/month at the same volume
Break-even: when usage > 30M tokens/day, self-hosting pays off. Below that, managed API is better (cost + ops).
Practical implementation: Ollama for development, vLLM for production (high throughput, automatic batching), TGI (Text Generation Inference) from Hugging Face as an alternative.
For Bradata clients in regulated segments (health, government, financial), we offer pluggable models: the client chooses API or self-hosting based on compliance and budget.
Pattern 6 — Structured function calling instead of text prompts
LLMs know how to use tools/functions structurally. Instead of asking for a response in text and parsing it later (costly, error-prone), you define functions and the LLM calls exactly what it needs.
Common mistake:
Prompt: "Respond in JSON with {name, taxId, amount}. Do not include extra text. Question: who is the client of tender X?"
Right pattern:
tools = [{
"name": "lookup_client",
"parameters": {
"type": "object",
"properties": {
"tax_id": {"type": "string"},
"include_history": {"type": "boolean"}
}
}
}]
response = llm.create(messages=[...], tools=tools)
# Response comes back as a structured tool_call, schema-guaranteed
Function calling has schema guarantee (Anthropic and OpenAI validate). Fewer tokens (no format instructions), zero parse errors, cheaper.
Pattern 7 — Cost observability and gatekeeping
Without observability, cost runs away without warning. Minimum stack:
Mandatory metrics
- Requests per day, per user, per task type
- Tokens consumed (input + output) per model
- Cost in USD per day
- Latency p50, p95, p99
- Error rate (timeouts, rate limits, empty model returns)
Tools
- Langfuse (open source, self-hosted) — complete LLM observability, traces, evals
- Helicone (SaaS) — proxy + observability
- PostHog (already in many companies) — LLM events plus general analytics
Gatekeeping
- Per-user rate limit (10 calls/minute, 200/day)
- Per-tenant budget cap (monthly)
- Cost alert (notify if daily threshold exceeded)
- Kill switch if cost spikes (LLM goes temporarily offline pending human review)
Without this, a bug or abusive use can generate $10k in fees overnight.
Comparative scenario: each pattern stacked
Mid-size ERP with 200 users, 5 requests/day average = 30k requests/day. Cost evolves with each pattern applied:
| Scenario | Monthly cost (USD) | Performance |
|---|---|---|
| Naïve (GPT-4o for everything, no RAG, no cache) | $43,000 | OK, but slow |
| + RAG | $17,000 (-60%) | Same |
| + Model routing | $7,200 (-58%) | Same or better |
| + Prompt caching | $4,200 (-42%) | Same |
| + Function calling | $3,500 (-17%) | Better (fewer errors) |
| + Batching of background jobs | $2,500 (-29%) | Same (batch latency ok) |
| + Self-hosting Llama 3.1 70B for high-frequency queries | $1,800 (-28%) | Same |
| TOTAL with all patterns | $1,800/month (-96%) | Same or better |
$43k down to $1.8k. 24× cheaper with disciplined architecture.
The trap: beware of over-engineering
Before implementing all 7 patterns at once, calibrate by volume:
| Daily volume | Minimum patterns |
|---|---|
| < 100 requests | Use API only. Single model OK. Observability. |
| 100–1,000 | + RAG + Caching. Model routing optional. |
| 1,000–10,000 | All except self-hosting (cost already fits the API) |
| 10,000–50,000 | All, with aggressive batching |
| > 50,000 | All + self-hosting for high-frequency queries |
A company starting with 200 requests/day implementing "Pattern 5 — self-hosting with 8x A100" from the start? Makes no sense. Wait for volume to justify it.
Compliance and privacy — where the pattern changes
For clients in healthcare (sensitive data), financial (BCB), government (secrecy), and legal (professional confidentiality), the equation shifts:
- Data cannot leave for OpenAI/Anthropic (without a processing agreement)
- Models with no-training clauses (Anthropic Enterprise, OpenAI Enterprise via Azure) may serve
- Self-hosting becomes preferred even when not the cheapest
For Bradata clients in these segments, we offer:
- Azure OpenAI (OpenAI model running on Azure, data does not leave tenant)
- Anthropic via Bedrock (data stays in the client's AWS account)
- Self-hosting Llama/Qwen/DeepSeek on dedicated GPUs
- Pseudonymization before any external call when applicable
Conclusion
Integrating LLM in ERP without architecture is easy. Integrating sustainably requires technical discipline. The 7 patterns above reduce cost between 60% and 96% with the same or better quality.
If you're planning to integrate LLM into your enterprise system, talk to us. Bradata has been building enterprise systems with integrated AI since 2023.
Bradata is a Brazilian software house with deep expertise in enterprise software with embedded AI. See our solutions and cases.
Sources: Anthropic API Pricing (May 2026), OpenAI API Pricing (May 2026), Groq Cloud Pricing, AWS GPU Instance Pricing (May 2026), Hugging Face Open LLM Leaderboard, internal Bradata projects, "Building LLM Applications for Production" — Chip Huyen 2024, MLPerf Inference Benchmark 2025.