LLM Client Architecture

The AI service uses an abstract LLMClient interface with concrete implementations for Anthropic (primary) and OpenAI (fallback), plus a CachingLLMClient decorator for semantic response caching.

Architecture

CachingLLMClient (decorator)
  └── AnthropicClient (primary)
       └── fallback → OpenAIClient

Abstract Interface

# services/ai/src/gospelib_ai/llm/client.py
from abc import ABC, abstractmethod

class LLMClient(ABC):
    @abstractmethod
    async def complete(
        self,
        system: str,
        messages: list[dict],
        max_tokens: int = 1024,
    ) -> str:
        """Send a completion request to the LLM provider."""
        ...

All LLM interactions go through this interface, making it straightforward to swap providers or add new ones.

Anthropic Client (Primary)

class AnthropicClient(LLMClient):
    def __init__(self, api_key: str, model: str = "claude-sonnet-4-20250514"):
        self._client = AsyncAnthropic(api_key=api_key)
        self._model = model

    async def complete(
        self,
        system: str,
        messages: list[dict],
        max_tokens: int = 1024,
    ) -> str:
        response = await self._client.messages.create(
            model=self._model,
            max_tokens=max_tokens,
            system=system,
            messages=messages,
        )
        return response.content[0].text

Uses AsyncAnthropic for non-blocking I/O
Default model: claude-sonnet-4-20250514 (configurable via env var)
The system prompt contains scholarly context and style guidelines

OpenAI Client (Fallback)

The OpenAI client follows the same interface, activated when the Anthropic client is unavailable or returns errors.

Caching Decorator

class CachingLLMClient(LLMClient):
    """Semantic caching — avoids redundant LLM calls for similar prompts."""

    def __init__(self, inner: LLMClient, redis_client, ttl: int = 3600):
        self._inner = inner
        self._redis = redis_client
        self._ttl = ttl

    async def complete(
        self,
        system: str,
        messages: list[dict],
        max_tokens: int = 1024,
    ) -> str:
        cache_key = self._hash(system, messages)
        if cached := await self._redis.get(f"gl:ai:cache:{cache_key}"):
            log.info("llm_cache_hit", cache_key=cache_key)
            return cached.decode()

        result = await self._inner.complete(system, messages, max_tokens)
        await self._redis.setex(
            f"gl:ai:cache:{cache_key}", self._ttl, result
        )
        return result

How Caching Works

Hash the prompt — The system prompt and messages are hashed to create a deterministic cache key
Check Redis — Look up gl:ai:cache:<hash> in Redis
On hit — Return cached response immediately (no LLM call)
On miss — Call the inner LLM client, cache the result, then return it

Cache Key Format

gl:ai:cache:<sha256(system + messages)>   →  TEXT   (TTL: 3600s)

Why Cache?

LLM API calls cost money and take 1–5 seconds
Many users ask the same questions about popular passages
Scripture content is immutable — explanations for the same passage + context don't change
The 1-hour TTL balances freshness with cost savings

Composing the Client

# Startup configuration
anthropic = AnthropicClient(api_key=settings.anthropic_api_key)
cached_client = CachingLLMClient(anthropic, redis_client, ttl=3600)

# Use cached_client for all LLM interactions
response = await cached_client.complete(system_prompt, messages)

The decorator pattern allows adding caching without modifying any provider implementation.

AI Service Overview — service configuration and setup
Prompt Templates — Jinja2 template system
Architecture > Data > Redis — Redis caching details

Architecture​

Abstract Interface​

Anthropic Client (Primary)​

OpenAI Client (Fallback)​

Caching Decorator​

How Caching Works​

Cache Key Format​

Why Cache?​

Composing the Client​

Related Pages​