Practical engineering and governance notes from the Northern Virginia tech corridor.
How I Build Model-Substitutable AI Integrations (And Why Most Enterprise Teams Don't)
A vendor-neutral pattern for integrating LLMs into enterprise applications — with code, trade-offs, and what to budget for.
About this post: I work on enterprise IT in the Northern Virginia tech corridor. Most of what I write about is the unglamorous, governance-heavy side of AI rollouts. This post is the opposite — it is the technical pattern I actually use when integrating LLMs into business applications. Code is in Python because that is what most of the integrations I work on are written in. The pattern translates to other languages with no meaningful changes.
Every team I have worked with on an enterprise AI integration has, at some point, written code that looks like this:
import openai
def summarize(text: str) -> str:
response = openai.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Summarize the following text in 2-3 sentences."},
{"role": "user", "content": text}
],
temperature=0.3
)
return response.choices[0].message.content
This works. It ships. It also — and I want to be direct about this — guarantees that switching providers in twelve months will require rewriting every place this pattern appears in your codebase. For a small project, that is acceptable. For an enterprise application where LLM calls are scattered across dozens of services, it is a slow-motion lock-in disaster.
Here is the pattern I use instead.
The actual problem
The problem is not "OpenAI vs Anthropic vs Google." The problem is that each provider's SDK has subtly different concepts:
Message structure differs. OpenAI uses
messageswithrolefields. Anthropic separatessystemfrommessages. Google Gemini usescontentswithparts. Cohere useschat_history.Parameter names differ.
max_tokensvsmax_output_tokensvsmaxOutputTokens.temperaturevstemperaturebut with different valid ranges (0-1 for some, 0-2 for others).Response shapes differ. Choices arrays, content blocks, candidates — every provider returns differently.
Streaming protocols differ. SSE event shapes vary; some providers chunk by token, others by sentence.
Error semantics differ. Rate limits, content filters, context-length errors all surface differently.
If you write directly against any one SDK, you have written code that knows about all of these specifics. Migrating it means changing every assumption, in every file.
The pattern: a thin internal abstraction
The fix is not a heavy framework. It is a thin abstraction layer that captures only what your application actually uses, expressed in your application's vocabulary.
Here is what mine looks like, stripped to essentials:
# llm/types.py
from dataclasses import dataclass
from typing import Literal, Optional, Iterator
from enum import Enum
class Role(Enum):
SYSTEM = "system"
USER = "user"
ASSISTANT = "assistant"
@dataclass
class Message:
role: Role
content: str
@dataclass
class CompletionRequest:
messages: list[Message]
max_tokens: int = 1024
temperature: float = 0.3
stop_sequences: Optional[list[str]] = None
@dataclass
class CompletionResponse:
content: str
input_tokens: int
output_tokens: int
model: str
finish_reason: Literal["complete", "max_tokens", "stop", "content_filter", "error"]
That is the contract. Notice what is not in it: no provider-specific concepts, no SDK types, no leaked abstractions. Your application code speaks this vocabulary and only this vocabulary.
The provider implementations sit behind a single interface:
# llm/client.py
from abc import ABC, abstractmethod
from llm.types import CompletionRequest, CompletionResponse
class LLMClient(ABC):
@abstractmethod
def complete(self, request: CompletionRequest) -> CompletionResponse:
...
@abstractmethod
def stream(self, request: CompletionRequest) -> Iterator[str]:
...
And then one adapter per provider. The adapters are the only place provider-specific logic exists.
# llm/providers/openai_client.py
import openai
from llm.client import LLMClient
from llm.types import CompletionRequest, CompletionResponse, Role
class OpenAIClient(LLMClient):
def __init__(self, api_key: str, model: str = "gpt-4o"):
self.client = openai.OpenAI(api_key=api_key)
self.model = model
def complete(self, request: CompletionRequest) -> CompletionResponse:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": m.role.value, "content": m.content}
for m in request.messages
],
max_tokens=request.max_tokens,
temperature=request.temperature,
stop=request.stop_sequences
)
return CompletionResponse(
content=response.choices[0].message.content,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
model=response.model,
finish_reason=self._normalize_finish(response.choices[0].finish_reason)
)
def _normalize_finish(self, reason: str) -> str:
return {
"stop": "complete",
"length": "max_tokens",
"content_filter": "content_filter"
}.get(reason, "error")
def stream(self, request: CompletionRequest):
stream = self.client.chat.completions.create(
model=self.model,
messages=[{"role": m.role.value, "content": m.content} for m in request.messages],
max_tokens=request.max_tokens,
temperature=request.temperature,
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
The Anthropic adapter looks structurally identical but handles the system-message separation and the content-block response shape:
# llm/providers/anthropic_client.py
import anthropic
from llm.client import LLMClient
from llm.types import CompletionRequest, CompletionResponse, Role
class AnthropicClient(LLMClient):
def __init__(self, api_key: str, model: str = "claude-opus-4-7"):
self.client = anthropic.Anthropic(api_key=api_key)
self.model = model
def complete(self, request: CompletionRequest) -> CompletionResponse:
system_msg = next(
(m.content for m in request.messages if m.role == Role.SYSTEM),
None
)
conversation = [
{"role": m.role.value, "content": m.content}
for m in request.messages if m.role != Role.SYSTEM
]
response = self.client.messages.create(
model=self.model,
system=system_msg,
messages=conversation,
max_tokens=request.max_tokens,
temperature=request.temperature,
stop_sequences=request.stop_sequences
)
return CompletionResponse(
content=response.content[0].text,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
model=response.model,
finish_reason=self._normalize_finish(response.stop_reason)
)
def _normalize_finish(self, reason: str) -> str:
return {
"end_turn": "complete",
"max_tokens": "max_tokens",
"stop_sequence": "stop"
}.get(reason, "error")
def stream(self, request: CompletionRequest):
# similar pattern, handling Anthropic's event stream
...
Now your application code looks like this:
# app/summarize.py
from llm.client import LLMClient
from llm.types import CompletionRequest, Message, Role
def summarize(client: LLMClient, text: str) -> str:
request = CompletionRequest(
messages=[
Message(Role.SYSTEM, "Summarize the following text in 2-3 sentences."),
Message(Role.USER, text)
],
temperature=0.3
)
response = client.complete(request)
return response.content
The application code does not know or care which provider is behind client. Swap providers by changing one wiring line at startup.
Where teams push back, and why I do this anyway
Reasonable engineers raise three objections to this pattern. They are worth taking seriously.
"This is over-engineering. YAGNI."
The YAGNI argument is the strongest one. For a single-feature integration in a hackathon project, yes, write directly against the SDK. For an enterprise application where AI features will spread across many services over multiple years, the abstraction pays for itself the first time you need to: switch providers for cost reasons, add a second provider for redundancy, route different requests to different providers based on data classification, swap to an on-prem model for sensitive workloads, or comply with a new procurement requirement.
I have watched all five of these happen at companies in the past two years. Teams that built behind an abstraction adapted in days. Teams that did not are still mid-migration.
"Provider-specific features are valuable. I lose them with an abstraction."
Partially true. The abstraction captures the common 80% of functionality. Provider-specific features (Anthropic's prompt caching, OpenAI's function calling specifics, Gemini's grounding) need either provider-specific escape hatches in the interface or specialized clients used directly where needed.
My rule: if a feature is core to the application, build it into the abstraction with provider-specific implementations behind it. If it is experimental or used in one place, allow direct SDK access for that one use case and accept it as an explicit migration cost later.
"Frameworks like LangChain already do this."
They do. They also bring opinionated abstractions about chains, agents, memory, and tooling that you may not want. For most enterprise use cases I see, a thin internal abstraction is easier to reason about, easier to debug, and has zero external dependencies beyond the underlying provider SDKs.
LangChain has its place — particularly for prototyping and for teams building agentic workflows from scratch. For straightforward request-response LLM integration in business applications, a 200-line internal module is almost always the better answer.
What this enables operationally
Once the abstraction is in place, several capabilities become easy:
Centralized observability. Wrap every complete() call with logging — token usage, latency, model, request hash, response excerpt. Now you have provider-neutral telemetry across every AI feature in the application.
Rate limit and retry handling. Implement once in the base client. Every adapter inherits it.
Cost tracking. Token counts come back in a normalized format. Aggregate them per feature, per user, per tenant — without writing provider-specific accounting code.
A/B testing across providers. Route 10% of requests to a second provider and compare quality, latency, and cost on the same prompts. This is the highest-ROI activity I do with this abstraction — you find out empirically which model fits your workload, instead of guessing from marketing benchmarks.
Data classification routing. Public-tier data goes to the cheapest provider; confidential data routes only to providers with appropriate data handling agreements. The routing logic lives in one place.
What it costs to build
For a Python codebase, this pattern is about 400–600 lines of code total: the types module, the client interface, two or three provider adapters, and tests. Implementation time for an experienced engineer: roughly two days.
For that two-day investment, you get vendor neutrality, centralized observability, and a clean migration path forever. The math has never been close, in my experience.
Where to start
If your codebase already has direct SDK calls scattered through it, do not try to refactor all of them at once. The approach I recommend:
Build the abstraction and the two adapters you care about most
Wrap new AI features behind the abstraction starting immediately
Refactor existing direct calls opportunistically — when you touch the file for another reason
Set a date six months out by which all AI integration goes through the abstraction
Add a lint rule preventing direct SDK imports in application code
Within two quarters, your codebase is vendor-neutral by default, and you have an organizational capability that will compound for years.
Adeel Ali is a Technology Manager based in Falls Church, Virginia, working with enterprise IT teams across the Northern Virginia tech corridor on AI governance, cloud strategy, and IT modernization. He writes at adeelali.substack.com and shares templates on GitHub.
