Skip to main content

Command Palette

Search for a command to run...

Practical engineering and governance notes from the Northern Virginia tech corridor.

Updated
8 min read

How I Build Model-Substitutable AI Integrations (And Why Most Enterprise Teams Don't)

A vendor-neutral pattern for integrating LLMs into enterprise applications — with code, trade-offs, and what to budget for.


About this post: I work on enterprise IT in the Northern Virginia tech corridor. Most of what I write about is the unglamorous, governance-heavy side of AI rollouts. This post is the opposite — it is the technical pattern I actually use when integrating LLMs into business applications. Code is in Python because that is what most of the integrations I work on are written in. The pattern translates to other languages with no meaningful changes.


Every team I have worked with on an enterprise AI integration has, at some point, written code that looks like this:

import openai

def summarize(text: str) -> str:
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Summarize the following text in 2-3 sentences."},
            {"role": "user", "content": text}
        ],
        temperature=0.3
    )
    return response.choices[0].message.content

This works. It ships. It also — and I want to be direct about this — guarantees that switching providers in twelve months will require rewriting every place this pattern appears in your codebase. For a small project, that is acceptable. For an enterprise application where LLM calls are scattered across dozens of services, it is a slow-motion lock-in disaster.

Here is the pattern I use instead.

The actual problem

The problem is not "OpenAI vs Anthropic vs Google." The problem is that each provider's SDK has subtly different concepts:

  • Message structure differs. OpenAI uses messages with role fields. Anthropic separates system from messages. Google Gemini uses contents with parts. Cohere uses chat_history.

  • Parameter names differ. max_tokens vs max_output_tokens vs maxOutputTokens. temperature vs temperature but with different valid ranges (0-1 for some, 0-2 for others).

  • Response shapes differ. Choices arrays, content blocks, candidates — every provider returns differently.

  • Streaming protocols differ. SSE event shapes vary; some providers chunk by token, others by sentence.

  • Error semantics differ. Rate limits, content filters, context-length errors all surface differently.

If you write directly against any one SDK, you have written code that knows about all of these specifics. Migrating it means changing every assumption, in every file.

The pattern: a thin internal abstraction

The fix is not a heavy framework. It is a thin abstraction layer that captures only what your application actually uses, expressed in your application's vocabulary.

Here is what mine looks like, stripped to essentials:

# llm/types.py
from dataclasses import dataclass
from typing import Literal, Optional, Iterator
from enum import Enum

class Role(Enum):
    SYSTEM = "system"
    USER = "user"
    ASSISTANT = "assistant"

@dataclass
class Message:
    role: Role
    content: str

@dataclass
class CompletionRequest:
    messages: list[Message]
    max_tokens: int = 1024
    temperature: float = 0.3
    stop_sequences: Optional[list[str]] = None

@dataclass
class CompletionResponse:
    content: str
    input_tokens: int
    output_tokens: int
    model: str
    finish_reason: Literal["complete", "max_tokens", "stop", "content_filter", "error"]

That is the contract. Notice what is not in it: no provider-specific concepts, no SDK types, no leaked abstractions. Your application code speaks this vocabulary and only this vocabulary.

The provider implementations sit behind a single interface:

# llm/client.py
from abc import ABC, abstractmethod
from llm.types import CompletionRequest, CompletionResponse

class LLMClient(ABC):
    @abstractmethod
    def complete(self, request: CompletionRequest) -> CompletionResponse:
        ...

    @abstractmethod
    def stream(self, request: CompletionRequest) -> Iterator[str]:
        ...

And then one adapter per provider. The adapters are the only place provider-specific logic exists.

# llm/providers/openai_client.py
import openai
from llm.client import LLMClient
from llm.types import CompletionRequest, CompletionResponse, Role

class OpenAIClient(LLMClient):
    def __init__(self, api_key: str, model: str = "gpt-4o"):
        self.client = openai.OpenAI(api_key=api_key)
        self.model = model

    def complete(self, request: CompletionRequest) -> CompletionResponse:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": m.role.value, "content": m.content}
                for m in request.messages
            ],
            max_tokens=request.max_tokens,
            temperature=request.temperature,
            stop=request.stop_sequences
        )

        return CompletionResponse(
            content=response.choices[0].message.content,
            input_tokens=response.usage.prompt_tokens,
            output_tokens=response.usage.completion_tokens,
            model=response.model,
            finish_reason=self._normalize_finish(response.choices[0].finish_reason)
        )

    def _normalize_finish(self, reason: str) -> str:
        return {
            "stop": "complete",
            "length": "max_tokens",
            "content_filter": "content_filter"
        }.get(reason, "error")

    def stream(self, request: CompletionRequest):
        stream = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": m.role.value, "content": m.content} for m in request.messages],
            max_tokens=request.max_tokens,
            temperature=request.temperature,
            stream=True
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

The Anthropic adapter looks structurally identical but handles the system-message separation and the content-block response shape:

# llm/providers/anthropic_client.py
import anthropic
from llm.client import LLMClient
from llm.types import CompletionRequest, CompletionResponse, Role

class AnthropicClient(LLMClient):
    def __init__(self, api_key: str, model: str = "claude-opus-4-7"):
        self.client = anthropic.Anthropic(api_key=api_key)
        self.model = model

    def complete(self, request: CompletionRequest) -> CompletionResponse:
        system_msg = next(
            (m.content for m in request.messages if m.role == Role.SYSTEM),
            None
        )
        conversation = [
            {"role": m.role.value, "content": m.content}
            for m in request.messages if m.role != Role.SYSTEM
        ]

        response = self.client.messages.create(
            model=self.model,
            system=system_msg,
            messages=conversation,
            max_tokens=request.max_tokens,
            temperature=request.temperature,
            stop_sequences=request.stop_sequences
        )

        return CompletionResponse(
            content=response.content[0].text,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            model=response.model,
            finish_reason=self._normalize_finish(response.stop_reason)
        )

    def _normalize_finish(self, reason: str) -> str:
        return {
            "end_turn": "complete",
            "max_tokens": "max_tokens",
            "stop_sequence": "stop"
        }.get(reason, "error")

    def stream(self, request: CompletionRequest):
        # similar pattern, handling Anthropic's event stream
        ...

Now your application code looks like this:

# app/summarize.py
from llm.client import LLMClient
from llm.types import CompletionRequest, Message, Role

def summarize(client: LLMClient, text: str) -> str:
    request = CompletionRequest(
        messages=[
            Message(Role.SYSTEM, "Summarize the following text in 2-3 sentences."),
            Message(Role.USER, text)
        ],
        temperature=0.3
    )
    response = client.complete(request)
    return response.content

The application code does not know or care which provider is behind client. Swap providers by changing one wiring line at startup.

Where teams push back, and why I do this anyway

Reasonable engineers raise three objections to this pattern. They are worth taking seriously.

"This is over-engineering. YAGNI."

The YAGNI argument is the strongest one. For a single-feature integration in a hackathon project, yes, write directly against the SDK. For an enterprise application where AI features will spread across many services over multiple years, the abstraction pays for itself the first time you need to: switch providers for cost reasons, add a second provider for redundancy, route different requests to different providers based on data classification, swap to an on-prem model for sensitive workloads, or comply with a new procurement requirement.

I have watched all five of these happen at companies in the past two years. Teams that built behind an abstraction adapted in days. Teams that did not are still mid-migration.

"Provider-specific features are valuable. I lose them with an abstraction."

Partially true. The abstraction captures the common 80% of functionality. Provider-specific features (Anthropic's prompt caching, OpenAI's function calling specifics, Gemini's grounding) need either provider-specific escape hatches in the interface or specialized clients used directly where needed.

My rule: if a feature is core to the application, build it into the abstraction with provider-specific implementations behind it. If it is experimental or used in one place, allow direct SDK access for that one use case and accept it as an explicit migration cost later.

"Frameworks like LangChain already do this."

They do. They also bring opinionated abstractions about chains, agents, memory, and tooling that you may not want. For most enterprise use cases I see, a thin internal abstraction is easier to reason about, easier to debug, and has zero external dependencies beyond the underlying provider SDKs.

LangChain has its place — particularly for prototyping and for teams building agentic workflows from scratch. For straightforward request-response LLM integration in business applications, a 200-line internal module is almost always the better answer.

What this enables operationally

Once the abstraction is in place, several capabilities become easy:

Centralized observability. Wrap every complete() call with logging — token usage, latency, model, request hash, response excerpt. Now you have provider-neutral telemetry across every AI feature in the application.

Rate limit and retry handling. Implement once in the base client. Every adapter inherits it.

Cost tracking. Token counts come back in a normalized format. Aggregate them per feature, per user, per tenant — without writing provider-specific accounting code.

A/B testing across providers. Route 10% of requests to a second provider and compare quality, latency, and cost on the same prompts. This is the highest-ROI activity I do with this abstraction — you find out empirically which model fits your workload, instead of guessing from marketing benchmarks.

Data classification routing. Public-tier data goes to the cheapest provider; confidential data routes only to providers with appropriate data handling agreements. The routing logic lives in one place.

What it costs to build

For a Python codebase, this pattern is about 400–600 lines of code total: the types module, the client interface, two or three provider adapters, and tests. Implementation time for an experienced engineer: roughly two days.

For that two-day investment, you get vendor neutrality, centralized observability, and a clean migration path forever. The math has never been close, in my experience.

Where to start

If your codebase already has direct SDK calls scattered through it, do not try to refactor all of them at once. The approach I recommend:

  1. Build the abstraction and the two adapters you care about most

  2. Wrap new AI features behind the abstraction starting immediately

  3. Refactor existing direct calls opportunistically — when you touch the file for another reason

  4. Set a date six months out by which all AI integration goes through the abstraction

  5. Add a lint rule preventing direct SDK imports in application code

Within two quarters, your codebase is vendor-neutral by default, and you have an organizational capability that will compound for years.


Adeel Ali is a Technology Manager based in Falls Church, Virginia, working with enterprise IT teams across the Northern Virginia tech corridor on AI governance, cloud strategy, and IT modernization. He writes at adeelali.substack.com and shares templates on GitHub.