LLM Client Architecture
The AI service uses an abstract LLMClient interface with concrete implementations for Anthropic (primary) and OpenAI (fallback), plus a CachingLLMClient decorator for semantic response caching.
Architecture
CachingLLMClient (decorator)
└── AnthropicClient (primary)
└── fallback → OpenAIClient
Abstract Interface
# services/ai/src/gospelib_ai/llm/client.py
from abc import ABC, abstractmethod
class LLMClient(ABC):
@abstractmethod
async def complete(
self,
system: str,
messages: list[dict],
max_tokens: int = 1024,
) -> str:
"""Send a completion request to the LLM provider."""
...
All LLM interactions go through this interface, making it straightforward to swap providers or add new ones.
Anthropic Client (Primary)
class AnthropicClient(LLMClient):
def __init__(self, api_key: str, model: str = "claude-sonnet-4-20250514"):
self._client = AsyncAnthropic(api_key=api_key)
self._model = model
async def complete(
self,
system: str,
messages: list[dict],
max_tokens: int = 1024,
) -> str:
response = await self._client.messages.create(
model=self._model,
max_tokens=max_tokens,
system=system,
messages=messages,
)
return response.content[0].text
- Uses
AsyncAnthropicfor non-blocking I/O - Default model:
claude-sonnet-4-20250514(configurable via env var) - The system prompt contains scholarly context and style guidelines
OpenAI Client (Fallback)
The OpenAI client follows the same interface, activated when the Anthropic client is unavailable or returns errors.
Caching Decorator
class CachingLLMClient(LLMClient):
"""Semantic caching — avoids redundant LLM calls for similar prompts."""
def __init__(self, inner: LLMClient, redis_client, ttl: int = 3600):
self._inner = inner
self._redis = redis_client
self._ttl = ttl
async def complete(
self,
system: str,
messages: list[dict],
max_tokens: int = 1024,
) -> str:
cache_key = self._hash(system, messages)
if cached := await self._redis.get(f"gl:ai:cache:{cache_key}"):
log.info("llm_cache_hit", cache_key=cache_key)
return cached.decode()
result = await self._inner.complete(system, messages, max_tokens)
await self._redis.setex(
f"gl:ai:cache:{cache_key}", self._ttl, result
)
return result
How Caching Works
- Hash the prompt — The system prompt and messages are hashed to create a deterministic cache key
- Check Redis — Look up
gl:ai:cache:<hash>in Redis - On hit — Return cached response immediately (no LLM call)
- On miss — Call the inner LLM client, cache the result, then return it
Cache Key Format
gl:ai:cache:<sha256(system + messages)> → TEXT (TTL: 3600s)
Why Cache?
- LLM API calls cost money and take 1–5 seconds
- Many users ask the same questions about popular passages
- Scripture content is immutable — explanations for the same passage + context don't change
- The 1-hour TTL balances freshness with cost savings
Composing the Client
# Startup configuration
anthropic = AnthropicClient(api_key=settings.anthropic_api_key)
cached_client = CachingLLMClient(anthropic, redis_client, ttl=3600)
# Use cached_client for all LLM interactions
response = await cached_client.complete(system_prompt, messages)
The decorator pattern allows adding caching without modifying any provider implementation.
Related Pages
- AI Service Overview — service configuration and setup
- Prompt Templates — Jinja2 template system
- Architecture > Data > Redis — Redis caching details